Skip to main content

Gradient Descent

Code

Please visit this page for a code implementation of the fixed step size gradiant method and this page for a code implementation of the steepest gradiant method.

Gradiant Descent

Suppose that we are given a point x(k)\boldsymbol{x}^{(k)}. To find the next point x(k+1)\boldsymbol{x}^{(k+1)}, we start at x(k)\boldsymbol{x}^{(k)} and move by an amount αkf(x(k))-\alpha_k \nabla f\left(\boldsymbol{x}^{(k)}\right), where αk\alpha_k is a positive scalar called the step size. This procedure leads to the following iterative algorithm:

x(k+1)=x(k)αkf(x(k)).\boldsymbol{x}^{(k+1)}=\boldsymbol{x}^{(k)}-\alpha_k \nabla f\left(\boldsymbol{x}^{(k)}\right) .

We refer to this as a gradient descent algorithm (or simply a gradient algorithm).

The Method of Steepest Descent

The method of steepest descent is a gradient algorithm where the step size αk\alpha_k is chosen to achieve the maximum amount of decrease of the objective function at each individual step. Specifically, αk\alpha_k is chosen to minimize ϕk(α)f(x(k)αf(x(k)))\phi_k(\alpha) \triangleq f\left(\boldsymbol{x}^{(k)}-\alpha \nabla f\left(\boldsymbol{x}^{(k)}\right)\right). In other words,

αk=argminα0f(x(k)αf(x(k))).\alpha_k=\underset{\alpha \geq 0}{\arg \min } f\left(\boldsymbol{x}^{(k)}-\alpha \nabla f\left(\boldsymbol{x}^{(k)}\right)\right) .

Observe that the method of steepest descent moves in orthogonal steps, as stated in the following proposition.

Proposition

If {x(k)}k=0\left\{\boldsymbol{x}^{(k)}\right\}_{k=0}^{\infty} is a steepest descent sequence for a given function f:RnRf: \mathbb{R}^n \rightarrow \mathbb{R}, then for each kk the vector x(k+1)x(k)\boldsymbol{x}^{(k+1)}-\boldsymbol{x}^{(k)} is orthogonal to the vector x(k+2)x(k+1)\boldsymbol{x}^{(k+2)}-\boldsymbol{x}^{(k+1)}.

Proof

From the iterative formula of the method of steepest descent it follows that

x(k+1)x(k),x(k+2)x(k+1)=αkαk+1f(x(k)),f(x(k+1)).\left\langle\boldsymbol{x}^{(k+1)}-\boldsymbol{x}^{(k)}, \boldsymbol{x}^{(k+2)}-\boldsymbol{x}^{(k+1)}\right\rangle=\alpha_k \alpha_{k+1}\left\langle\nabla f\left(\boldsymbol{x}^{(k)}\right), \nabla f\left(\boldsymbol{x}^{(k+1)}\right)\right\rangle .

To complete the proof it is enough to show that

f(x(k)),f(x(k+1))=0.\left\langle\nabla f\left(\boldsymbol{x}^{(k)}\right), \nabla f\left(\boldsymbol{x}^{(k+1)}\right)\right\rangle=0 .

To this end, observe that αk\alpha_k is a nonnegative scalar that minimizes ϕk(α)\phi_k(\alpha) \triangleq f(x(k)αf(x(k)))f\left(\boldsymbol{x}^{(k)}-\alpha \nabla f\left(\boldsymbol{x}^{(k)}\right)\right). Hence, using the FONC and the chain rule gives us

0=ϕk(αk)=dϕkdα(αk)=f(x(k)αkf(x(k)))(f(x(k)))=f(x(k+1)),f(x(k)),\begin{aligned} 0 &=\phi_k^{\prime}\left(\alpha_k\right) \\ &=\frac{d \phi_k}{d \alpha}\left(\alpha_k\right) \\ &=\nabla f\left(\boldsymbol{x}^{(k)}-\alpha_k \nabla f\left(\boldsymbol{x}^{(k)}\right)\right)^{\top}\left(-\nabla f\left(\boldsymbol{x}^{(k)}\right)\right) \\ &=-\left\langle\nabla f\left(\boldsymbol{x}^{(k+1)}\right), \nabla f\left(\boldsymbol{x}^{(k)}\right)\right\rangle, \end{aligned}

which completes the proof. The proposition above implies that f(x(k))\nabla f\left(\boldsymbol{x}^{(k)}\right) is parallel to the tangent plane to the level set {f(x)=f(x(k+1))}\left\{f(\boldsymbol{x})=f\left(\boldsymbol{x}^{(k+1)}\right)\right\} at x(k+1)\boldsymbol{x}^{(k+1)}. Note that as each new point is generated by the steepest descent algorithm, the corresponding value of the function ff decreases in value, as stated below.

Proposition

If {x(k)}k=0\left\{\boldsymbol{x}^{(k)}\right\}_{k=0}^{\infty} is the steepest descent sequence for f:RnRf: \mathbb{R}^n \rightarrow \mathbb{R} and if f(x(k))0\nabla f\left(\boldsymbol{x}^{(k)}\right) \neq \mathbf{0}, then f(x(k+1))<f(x(k))f\left(\boldsymbol{x}^{(k+1)}\right)<f\left(\boldsymbol{x}^{(k)}\right).

Proof

First recall that

x(k+1)=x(k)αkf(x(k)),\boldsymbol{x}^{(k+1)}=\boldsymbol{x}^{(k)}-\alpha_k \nabla f\left(\boldsymbol{x}^{(k)}\right),

where αk0\alpha_k \geq 0 is the minimizer of

ϕk(α)=f(x(k)αf(x(k)))\phi_k(\alpha)=f\left(\boldsymbol{x}^{(k)}-\alpha \nabla f\left(\boldsymbol{x}^{(k)}\right)\right)

over all α0\alpha \geq 0. Thus, for α0\alpha \geq 0, we have

ϕk(αk)ϕk(α).\phi_k\left(\alpha_k\right) \leq \phi_k(\alpha) .

By the chain rule,

ϕk(0)=dϕkdα(0)=(f(x(k)0f(x(k))))f(x(k))=f(x(k))2<0\phi_k^{\prime}(0)=\frac{d \phi_k}{d \alpha}(0)=-\left(\nabla f\left(\boldsymbol{x}^{(k)}-0 \nabla f\left(\boldsymbol{x}^{(k)}\right)\right)\right)^{\top} \nabla f\left(\boldsymbol{x}^{(k)}\right)=-\left\|\nabla f\left(\boldsymbol{x}^{(k)}\right)\right\|^2<0

because f(x(k))0\nabla f\left(\boldsymbol{x}^{(k)}\right) \neq \mathbf{0} by assumption. Thus, ϕk(0)<0\phi_k^{\prime}(0)<0 and this implies that there is an αˉ>0\bar{\alpha}>0 such that ϕk(0)>ϕk(α)\phi_k(0)>\phi_k(\alpha) for all α(0,αˉ]\alpha \in(0, \bar{\alpha}]. Hence,

f(x(k+1))=ϕk(αk)ϕk(αˉ)<ϕk(0)=f(x(k)),f\left(\boldsymbol{x}^{(k+1)}\right)=\phi_k\left(\alpha_k\right) \leq \phi_k(\bar{\alpha})<\phi_k(0)=f\left(\boldsymbol{x}^{(k)}\right),

which completes the proof.

Analysis of Gradient Methods

Let us now see what the method of steepest descent does with a quadratic function of the form

f(x)=12xQxbx,f(\boldsymbol{x})=\frac{1}{2} \boldsymbol{x}^{\top} \boldsymbol{Q} \boldsymbol{x}-\boldsymbol{b}^{\top} \boldsymbol{x},

where QRn×n\boldsymbol{Q} \in \mathbb{R}^{n \times n} is a symmetric positive definite matrix, bRn\boldsymbol{b} \in \mathbb{R}^n, and xRn\boldsymbol{x} \in \mathbb{R}^n. The unique minimizer of ff can be found by setting the gradient of ff to zero, where

f(x)=Qxb,\nabla f(\boldsymbol{x})=\boldsymbol{Q} \boldsymbol{x}-\boldsymbol{b},

because D(xQx)=x(Q+Q)=2xQD\left(\boldsymbol{x}^{\top} \boldsymbol{Q} \boldsymbol{x}\right)=\boldsymbol{x}^{\top}\left(\boldsymbol{Q}+\boldsymbol{Q}^{\top}\right)=2 \boldsymbol{x}^{\top} \boldsymbol{Q}, and D(bx)=bD\left(\boldsymbol{b}^{\top} \boldsymbol{x}\right)=\boldsymbol{b}^{\top}. There is no loss of generality in assuming QQ to be a symmetric matrix. For if we are given a quadratic form xAx\boldsymbol{x}^{\top} \boldsymbol{A} \boldsymbol{x} and AA\boldsymbol{A} \neq \boldsymbol{A}^{\top}, then because the transposition of a scalar equals itself, we obtain

(xAx)=xAx=xAx.\left(\boldsymbol{x}^{\top} \boldsymbol{A} \boldsymbol{x}\right)^{\top}=\boldsymbol{x}^{\top} \boldsymbol{A}^{\top} \boldsymbol{x}=\boldsymbol{x}^{\top} \boldsymbol{A} \boldsymbol{x} .

Hence,

xAx=12xAx+12xAx=12x(A+A)x12xQx.\begin{aligned} \boldsymbol{x}^{\top} \boldsymbol{A} \boldsymbol{x} &=\frac{1}{2} \boldsymbol{x}^{\top} \boldsymbol{A} \boldsymbol{x}+\frac{1}{2} \boldsymbol{x}^{\top} \boldsymbol{A}^{\top}\boldsymbol{x} \\ &=\frac{1}{2} \boldsymbol{x}^{\top}\left(\boldsymbol{A}+\boldsymbol{A}^{\top}\right) \boldsymbol{x} \\ & \triangleq \frac{1}{2} \boldsymbol{x}^{\top} \boldsymbol{Q} \boldsymbol{x} . \end{aligned}

Note that

(A+A)=Q=A+A=Q.\left(\boldsymbol{A}+\boldsymbol{A}^{\top}\right)^{\top}=\boldsymbol{Q}^{\top}=\boldsymbol{A}+\boldsymbol{A}^{\top}=\boldsymbol{Q} .

The Hessian of ff is F(x)=Q=Q>0\boldsymbol{F}(\boldsymbol{x})=\boldsymbol{Q}=\boldsymbol{Q}^{\top}>0. To simplify the notation we write g(k)=f(x(k))\boldsymbol{g}^{(k)}=\nabla f\left(\boldsymbol{x}^{(k)}\right). Then, the steepest descent algorithm for the quadratic function can be represented as

x(k+1)=x(k)αkg(k),\boldsymbol{x}^{(k+1)}=\boldsymbol{x}^{(k)}-\alpha_k \boldsymbol{g}^{(k)},

where

αk=argminα0f(x(k)αg(k))=argminα0(12(x(k)αg(k))Q(x(k)αg(k))(x(k)αg(k))b).\begin{aligned} \alpha_k &=\underset{\alpha \geq 0}{\arg \min } f\left(\boldsymbol{x}^{(k)}-\alpha \boldsymbol{g}^{(k)}\right) \\ &=\underset{\alpha \geq 0}{\arg \min }\left(\frac{1}{2}\left(\boldsymbol{x}^{(k)}-\alpha \boldsymbol{g}^{(k)}\right)^{\top} \boldsymbol{Q}\left(\boldsymbol{x}^{(k)}-\alpha \boldsymbol{g}^{(k)}\right)-\left(\boldsymbol{x}^{(k)}-\alpha \boldsymbol{g}^{(k)}\right)^{\top} \boldsymbol{b}\right) . \end{aligned}

In the quadratic case, we can find an explicit formula for αk\alpha_k. We proceed as follows. Assume that g(k)0\boldsymbol{g}^{(k)} \neq \mathbf{0}, for if g(k)=0\boldsymbol{g}^{(k)}=\mathbf{0}, then x(k)=x\boldsymbol{x}^{(k)}=\boldsymbol{x}^* and the algorithm stops. Because αk0\alpha_k \geq 0 is a minimizer of ϕk(α)=f(x(k)αg(k))\phi_k(\alpha)=f\left(\boldsymbol{x}^{(k)}-\alpha \boldsymbol{g}^{(k)}\right), we apply the FONC to ϕk(α)\phi_k(\alpha) to obtain

ϕk(α)=(x(k)αg(k))Q(g(k))b(g(k)).\phi_k^{\prime}(\alpha)=\left(\boldsymbol{x}^{(k)}-\alpha \boldsymbol{g}^{(k)}\right)^{\top} \boldsymbol{Q}\left(-\boldsymbol{g}^{(k)}\right)-\boldsymbol{b}^{\top}\left(-\boldsymbol{g}^{(k)}\right) .

Therefore, ϕk(α)=0\phi_k^{\prime}(\alpha)=0 if αg(k)Qg(k)=(x(k)Qb)g(k)\alpha \boldsymbol{g}^{(k) \top} \boldsymbol{Q} \boldsymbol{g}^{(k)}=\left(\boldsymbol{x}^{(k) \top} \boldsymbol{Q}-\boldsymbol{b}^{\top}\right) \boldsymbol{g}^{(k)}. But

x(k)Qb=g(k).\boldsymbol{x}^{(k) \top} \boldsymbol{Q}-\boldsymbol{b}^{\top}=\boldsymbol{g}^{(k) \top} .

Hence,

αk=g(k)g(k)g(k)Qg(k).\alpha_k=\frac{\boldsymbol{g}^{(k) \top} \boldsymbol{g}^{(k)}}{\boldsymbol{g}^{(k) \top} \boldsymbol{Q} \boldsymbol{g}^{(k)}} .

In summary, the method of steepest descent for the quadratic takes the form

x(k+1)=x(k)g(k)g(k)g(k)Qg(k)g(k),\boldsymbol{x}^{(k+1)}=\boldsymbol{x}^{(k)}-\frac{\boldsymbol{g}^{(k) \top} \boldsymbol{g}^{(k)}}{\boldsymbol{g}^{(k) \top} \boldsymbol{Q} \boldsymbol{g}^{(k)}} \boldsymbol{g}^{(k)},

where

g(k)=f(x(k))=Qx(k)b.\boldsymbol{g}^{(k)}=\nabla f\left(\boldsymbol{x}^{(k)}\right)=\boldsymbol{Q} \boldsymbol{x}^{(k)}-\boldsymbol{b} .

Convergence

We can investigate important convergence characteristics of a gradient method by applying the method to quadratic problems. The convergence anaylysis is more convenient if instead of working with ff we deal with

V(x)=f(x)+12xQx=12(xx)Q(xx),V(\boldsymbol{x})=f(\boldsymbol{x})+\frac{1}{2} \boldsymbol{x}^{* \top} \boldsymbol{Q} \boldsymbol{x}^*=\frac{1}{2}\left(\boldsymbol{x}-\boldsymbol{x}^*\right)^{\top} \boldsymbol{Q}\left(\boldsymbol{x}-\boldsymbol{x}^*\right),

where Q=Q>0\boldsymbol{Q}=\boldsymbol{Q}^{\top}>0. The solution point x\boldsymbol{x}^* is obtained by solving Qx=\boldsymbol{Q} \boldsymbol{x}= b\boldsymbol{b}; that is, x=Q1b\boldsymbol{x}^*=\boldsymbol{Q}^{-1} \boldsymbol{b}. The function VV differs from ff only by a constant 12xQx\frac{1}{2} \boldsymbol{x}^{* \top} \boldsymbol{Q} \boldsymbol{x}^*. We begin our analysis with the following useful lemma that applies to a general gradient algorithm.

Lemma 1

The iterative algorithm

x(k+1)=x(k)αkg(k)\boldsymbol{x}^{(k+1)}=\boldsymbol{x}^{(k)}-\alpha_k \boldsymbol{g}^{(k)}

with g(k)=Qx(k)b\boldsymbol{g}^{(k)}=\boldsymbol{Q} \boldsymbol{x}^{(k)}-\boldsymbol{b} satisfies

V(x(k+1))=(1γk)V(x(k)),V\left(\boldsymbol{x}^{(k+1)}\right)=\left(1-\gamma_k\right) V\left(\boldsymbol{x}^{(k)}\right),

where if g(k)=0\boldsymbol{g}^{(k)}=\mathbf{0}, then γk=1\gamma_k=1, and if g(k)0\boldsymbol{g}^{(k)} \neq \mathbf{0}, then

γk=αkg(k)Qg(k)g(k)Q1g(k)(2g(k)g(k)g(k)Qg(k)αk).\gamma_k=\alpha_k \frac{\boldsymbol{g}^{(k) \top} \boldsymbol{Q} \boldsymbol{g}^{(k)}}{\boldsymbol{g}^{(k) \top} \boldsymbol{Q}^{-1} \boldsymbol{g}^{(k)}}\left(2 \frac{\boldsymbol{g}^{(k) \top} \boldsymbol{g}^{(k)}}{\boldsymbol{g}^{(k) \top} \boldsymbol{Q} \boldsymbol{g}^{(k)}}-\alpha_k\right) .

Proof

The proof is by direct computation. Note that if g(k)=0\boldsymbol{g}^{(k)}=\mathbf{0}, then the desired result holds trivially. In the remainder of the proof, assume that g(k)0\boldsymbol{g}^{(k)} \neq \mathbf{0}. We first evaluate the expression

V(x(k))V(x(k+1))V(x(k))\frac{V\left(\boldsymbol{x}^{(k)}\right)-V\left(\boldsymbol{x}^{(k+1)}\right)}{V\left(\boldsymbol{x}^{(k)}\right)}

To facilitate computations, let y(k)=x(k)x\boldsymbol{y}^{(k)}=\boldsymbol{x}^{(k)}-\boldsymbol{x}^*. Then, V(x(k))=V\left(\boldsymbol{x}^{(k)}\right)= 12y(k)Qy(k)\frac{1}{2} \boldsymbol{y}^{(k) \top} \boldsymbol{Q} \boldsymbol{y}^{(k)}. Hence,

V(x(k+1))=12(x(k+1)x)Q(x(k+1)x)=12(x(k)xαkg(k))Q(x(k)xαkg(k))=12y(k)Qy(k)αkg(k)Qy(k)+12αk2g(k)Qg(k)\begin{aligned} V\left(\boldsymbol{x}^{(k+1)}\right) &=\frac{1}{2}\left(\boldsymbol{x}^{(k+1)}-\boldsymbol{x}^*\right)^{\top} \boldsymbol{Q}\left(\boldsymbol{x}^{(k+1)}-\boldsymbol{x}^*\right) \\ &=\frac{1}{2}\left(\boldsymbol{x}^{(k)}-\boldsymbol{x}^*-\alpha_k \boldsymbol{g}^{(k)}\right)^{\top} \boldsymbol{Q}\left(\boldsymbol{x}^{(k)}-\boldsymbol{x}^*-\alpha_k \boldsymbol{g}^{(k)}\right) \\ &=\frac{1}{2} \boldsymbol{y}^{(k) \top} \boldsymbol{Q} \boldsymbol{y}^{(k)}-\alpha_k \boldsymbol{g}^{(k) \top} \boldsymbol{Q} \boldsymbol{y}^{(k)}+\frac{1}{2} \alpha_k^2 \boldsymbol{g}^{(k) \top} \boldsymbol{Q} \boldsymbol{g}^{(k)} \end{aligned}

Therefore,

V(x(k))V(x(k+1))V(x(k))=2αkg(k)Qy(k)αk2g(k)Qg(k)y(k)Qy(k)\frac{V\left(\boldsymbol{x}^{(k)}\right)-V\left(\boldsymbol{x}^{(k+1)}\right)}{V\left(\boldsymbol{x}^{(k)}\right)}=\frac{2 \alpha_k \boldsymbol{g}^{(k) \top} \boldsymbol{Q} \boldsymbol{y}^{(k)}-\alpha_k^2 \boldsymbol{g}^{(k) \top} \boldsymbol{Q} \boldsymbol{g}^{(k)}}{\boldsymbol{y}^{(k) \top} \boldsymbol{Q} \boldsymbol{y}^{(k)}}

Because

g(k)=Qx(k)b=Qx(k)Qx=Qy(k)\boldsymbol{g}^{(k)}=\boldsymbol{Q} \boldsymbol{x}^{(k)}-\boldsymbol{b}=\boldsymbol{Q} \boldsymbol{x}^{(k)}-\boldsymbol{Q} \boldsymbol{x}^*=\boldsymbol{Q} \boldsymbol{y}^{(k)}

we have

y(k)Qy(k)=g(k)Q1g(k),g(k)Qy(k)=g(k)g(k)\begin{aligned} &\boldsymbol{y}^{(k) \top} \boldsymbol{Q} \boldsymbol{y}^{(k)}=\boldsymbol{g}^{(k) \top} \boldsymbol{Q}^{-1} \boldsymbol{g}^{(k)}, \\ &\boldsymbol{g}^{(k) \top} \boldsymbol{Q} \boldsymbol{y}^{(k)}=\boldsymbol{g}^{(k) \top} \boldsymbol{g}^{(k)} \end{aligned}

Therefore, substituting the above, we get

V(x(k))V(x(k+1))V(x(k))=αkg(k)Qg(k)g(k)Q1g(k)(2g(k)g(k)g(k)Qg(k)αk)=γk.\frac{V\left(\boldsymbol{x}^{(k)}\right)-V\left(\boldsymbol{x}^{(k+1)}\right)}{V\left(\boldsymbol{x}^{(k)}\right)}=\alpha_k \frac{\boldsymbol{g}^{(k) \top} \boldsymbol{Q} \boldsymbol{g}^{(k)}}{\boldsymbol{g}^{(k) \top} \boldsymbol{Q}^{-1} \boldsymbol{g}^{(k)}}\left(2 \frac{\boldsymbol{g}^{(k) \top} \boldsymbol{g}^{(k)}}{\boldsymbol{g}^{(k) \top} \boldsymbol{Q} \boldsymbol{g}^{(k)}}-\alpha_k\right)=\gamma_k .

Note that γk1\gamma_k \leq 1 for all kk, because γk=1V(x(k+1))/V(x(k))\gamma_k=1-V\left(\boldsymbol{x}^{(k+1)}\right) / V\left(\boldsymbol{x}^{(k)}\right) and VV is a nonnegative function. If γk=1\gamma_k=1 for some kk, then V(x(k+1))=0V\left(\boldsymbol{x}^{(k+1)}\right)=0, which is equivalent to x(k+1)=x\boldsymbol{x}^{(k+1)}=\boldsymbol{x}^*. In this case we also have that for all ik+1i \geq k+1, x(i)=x\boldsymbol{x}^{(i)}=\boldsymbol{x}^* and γi=1\gamma_i=1. It turns out that γk=1\gamma_k=1 if and only if either gˉ(k)=0\bar{g}^{(k)}=\mathbf{0} or g(k)\boldsymbol{g}^{(k)} is an eigenvector of Q\boldsymbol{Q} (see Lemma 3).

We are now ready to state and prove our key convergence theorem for gradient methods. The theorem gives a necessary and sufficient condition for the sequence {x(k)}\left\{\boldsymbol{x}^{(k)}\right\} generated by a gradient method to converge to x\boldsymbol{x}^*; that is, x(k)x\boldsymbol{x}^{(k)} \rightarrow \boldsymbol{x}^* or limkx(k)=x\lim _{k \rightarrow \infty} \boldsymbol{x}^{(k)}=\boldsymbol{x}^*.

Theorem 1

Let {x(k)}\left\{\boldsymbol{x}^{(k)}\right\} be the sequence resulting from a gradient algorithm x(k+1)=x(k)αkg(k)\boldsymbol{x}^{(k+1)}=\boldsymbol{x}^{(k)}-\alpha_k \boldsymbol{g}^{(k)}. Let γk\gamma_k be as defined in Lemma 1, and suppose that γk>0\gamma_k>0 for all kk. Then, {x(k)}\left\{\boldsymbol{x}^{(k)}\right\} converges to x\boldsymbol{x}^* for any initial condition x(0)\boldsymbol{x}^{(0)} if and only if

k=0γk=\sum_{k=0}^{\infty} \gamma_k=\infty

Proof

From Lemma 1 we have V(x(k+1))=(1γk)V(x(k))V\left(\boldsymbol{x}^{(k+1)}\right)=\left(1-\gamma_k\right) V\left(\boldsymbol{x}^{(k)}\right), from which we obtain

V(x(k))=(i=0k1(1γi))V(x(0))V\left(\boldsymbol{x}^{(k)}\right)=\left(\prod_{i=0}^{k-1}\left(1-\gamma_i\right)\right) V\left(\boldsymbol{x}^{(0)}\right)

Assume that γk<1\gamma_k<1 for all kk, for otherwise the result holds trivially. Note that x(k)x\boldsymbol{x}^{(k)} \rightarrow \boldsymbol{x}^* if and only if V(x(k))0V\left(\boldsymbol{x}^{(k)}\right) \rightarrow 0. By the equation above we see that this occurs if and only if i=0(1γi)=0\prod_{i=0}^{\infty}\left(1-\gamma_i\right)=0, which, in turn, holds if and only if i=0log(1γi)=\sum_{i=0}^{\infty}-\log \left(1-\gamma_i\right)=\infty (we get this simply by taking logs). Note that by Lemma 1, 1γi01-\gamma_i \geq 0 and log(1γi)\log \left(1-\gamma_i\right) is well-defined [log(0)[\log (0) is taken to be ]-\infty]. Therefore, it remains to show that i=0log(1γi)=\sum_{i=0}^{\infty}-\log \left(1-\gamma_i\right)=\infty if and only if

i=0γi=\sum_{i=0}^{\infty} \gamma_i=\infty

We first show that i=0γi=\sum_{i=0}^{\infty} \gamma_i=\infty implies that i=0log(1γi)=\sum_{i=0}^{\infty}-\log \left(1-\gamma_i\right)=\infty. For this, first observe that for any xR,x>0x \in \mathbb{R}, x>0, we have log(x)x1\log (x) \leq x-1 [this is easy to see simply by plotting log(x)\log (x) and x1x-1 versus xx ]. Therefore, log(1γi)1γi1=γi\log \left(1-\gamma_i\right) \leq 1-\gamma_i-1=-\gamma_i, and hence log(1γi)γi-\log \left(1-\gamma_i\right) \geq \gamma_i. Thus, if i=0γi=\sum_{i=0}^{\infty} \gamma_i=\infty, then clearly i=0log(1γi)=\sum_{i=0}^{\infty}-\log \left(1-\gamma_i\right)=\infty.

Finally, we show that i=0log(1γi)=\sum_{i=0}^{\infty}-\log \left(1-\gamma_i\right)=\infty implies that i=0γi=\sum_{i=0}^{\infty} \gamma_i=\infty. We proceed by contraposition. Suppose that i=0γi<\sum_{i=0}^{\infty} \gamma_i<\infty. Then, it must be that γi0\gamma_i \rightarrow 0. Now observe that for xR,x1x \in \mathbb{R}, x \leq 1 and xx sufficiently close to 1 , we have log(x)2(x1)\log (x) \geq 2(x-1) [as before, this is easy to see simply by plotting log(x)\log (x) and 2(x1)2(x-1) versus xx]. Therefore, for sufficiently large ii, log(1γi)2(1γi1)=2γi\log \left(1-\gamma_i\right) \geq 2\left(1-\gamma_i-1\right)=-2 \gamma_i, which implies that log(1γi)2γi-\log \left(1-\gamma_i\right) \leq 2 \gamma_i. Hence, i=0γi<\sum_{i=0}^{\infty} \gamma_i<\infty implies that i=0log(1γi)<\sum_{i=0}^{\infty}-\log \left(1-\gamma_i\right)<\infty. This completes the proof.

Lemma 2

Let Q=Q>0\boldsymbol{Q}=\boldsymbol{Q}^{\top}>0 be an n×nn \times n real symmetric positive definite matrix. Then, for any xRn\boldsymbol{x} \in \mathbb{R}^n, we have

λmin(Q)λmax(Q)(xx)2(xQx)(xQ1x)λmax(Q)λmin(Q)\frac{\lambda_{\min }(\boldsymbol{Q})}{\lambda_{\max }(\boldsymbol{Q})} \leq \frac{\left(\boldsymbol{x}^{\top} \boldsymbol{x}\right)^2}{\left(\boldsymbol{x}^{\top} \boldsymbol{Q} \boldsymbol{x}\right)\left(\boldsymbol{x}^{\top} \boldsymbol{Q}^{-1} \boldsymbol{x}\right)} \leq \frac{\lambda_{\max }(\boldsymbol{Q})}{\lambda_{\min }(\boldsymbol{Q})}

Proof for Lemma 2

Applying Rayleigh's inequality and using the properties of symmetric positive definite matrices listed previously, we get

(xx)2(xQx)(xQ1x)x4λmin(Q)x2λmin(Q1)x2=λmax(Q)λmin(Q)\frac{\left(\boldsymbol{x}^{\top} \boldsymbol{x}\right)^2}{\left(\boldsymbol{x}^{\top} \boldsymbol{Q} \boldsymbol{x}\right)\left(\boldsymbol{x}^{\top} \boldsymbol{Q}^{-1} \boldsymbol{x}\right)} \leq \frac{\|\boldsymbol{x}\|^4}{\lambda_{\min }(\boldsymbol{Q})\|\boldsymbol{x}\|^2 \lambda_{\min }\left(\boldsymbol{Q}^{-1}\right)\|\boldsymbol{x}\|^2}=\frac{\lambda_{\max }(\boldsymbol{Q})}{\lambda_{\min }(\boldsymbol{Q})}

and

(xx)2(xQx)(xQ1x)x4λmax(Q)x2λmax(Q1)x2=λmin(Q)λmax(Q)\frac{\left(\boldsymbol{x}^{\top} \boldsymbol{x}\right)^2}{\left(\boldsymbol{x}^{\top} \boldsymbol{Q} \boldsymbol{x}\right)\left(\boldsymbol{x}^{\top} \boldsymbol{Q}^{-1} \boldsymbol{x}\right)} \geq \frac{\|\boldsymbol{x}\|^4}{\lambda_{\max }(\boldsymbol{Q})\|\boldsymbol{x}\|^2 \lambda_{\max }\left(\boldsymbol{Q}^{-1}\right)\|\boldsymbol{x}\|^2}=\frac{\lambda_{\min }(\boldsymbol{Q})}{\lambda_{\max }(\boldsymbol{Q})}

We are now ready to establish the convergence of the steepest descent method.

Theorem 2

In the steepest descent algorithm, we have x(k)x\boldsymbol{x}^{(k)} \rightarrow \boldsymbol{x}^* for any x(0)\boldsymbol{x}^{(0)}.

Proof

If g(k)=0\boldsymbol{g}^{(k)}=\mathbf{0} for some kk, then x(k)=x\boldsymbol{x}^{(k)}=\boldsymbol{x}^* and the result holds. So assume that g(k)0\boldsymbol{g}^{(k)} \neq \mathbf{0} for all kk. Recall that for the steepest descent algorithm,

αk=g(k)g(k)g(k)Qg(k).\alpha_k=\frac{\boldsymbol{g}^{(k) \top} \boldsymbol{g}^{(k)}}{\boldsymbol{g}^{(k) \top} \boldsymbol{Q} \boldsymbol{g}^{(k)}} .

Substituting this expression for αk\alpha_k in the formula for γk\gamma_k yields

γk=(g(k)g(k))2(g(k)Qg(k))(g(k)Q1g(k)).\gamma_k=\frac{\left(\boldsymbol{g}^{(k) \top} \boldsymbol{g}^{(k)}\right)^2}{\left(\boldsymbol{g}^{(k) \top} \boldsymbol{Q} \boldsymbol{g}^{(k)}\right)\left(\boldsymbol{g}^{(k) \top} \boldsymbol{Q}^{-1} \boldsymbol{g}^{(k)}\right)} .

Note that in this case γk>0\gamma_k>0 for all kk. Furthermore, by Lemma 8.2, we have γk(λmin(Q)/λmax(Q))>0\gamma_k \geq\left(\lambda_{\min }(\boldsymbol{Q}) / \lambda_{\max }(\boldsymbol{Q})\right)>0. Therefore, we have k=0γk=\sum_{k=0}^{\infty} \gamma_k=\infty, and hence by Theorem 8.18.1 we conclude that x(k)x\boldsymbol{x}^{(k)} \rightarrow \boldsymbol{x}^*.

Consider now a gradient method with fixed step size; that is, αk=αR\alpha_k=\alpha \in \mathbb{R} for all kk. The resulting algorithm is of the form

x(k+1)=x(k)αg(k).\boldsymbol{x}^{(k+1)}=\boldsymbol{x}^{(k)}-\alpha \boldsymbol{g}^{(k)} .

We refer to the algorithm above as a fixed-step-size gradient algorithm. The algorithm is of practical interest because of its simplicity. In particular, the algorithm does not require a line search at each step to determine αk\alpha_k, because the same step size α\alpha is used at each step. Clearly, the convergence of the algorithm depends on the choice of α\alpha, and we would not expect the algorithm to work for arbitrary α\alpha. The following theorem gives a necessary and sufficient condition on α\alpha for convergence of the algorithm.

Theorem 3

For the fixed-step-size gradient algorithm, x(k)x\boldsymbol{x}^{(k)} \rightarrow \boldsymbol{x}^* for any x(0)\boldsymbol{x}^{(0)} if and only if

0<α<2λmax(Q).0<\alpha<\frac{2}{\lambda_{\max }(\boldsymbol{Q})} .

Proof

:\Leftarrow: By Rayleigh's inequality we have

λmin(Q)g(k)g(k)g(k)Qg(k)λmax(Q)g(k)g(k)\lambda_{\min }(\boldsymbol{Q}) \boldsymbol{g}^{(k) \top} \boldsymbol{g}^{(k)} \leq \boldsymbol{g}^{(k) \top} \boldsymbol{Q} \boldsymbol{g}^{(k)} \leq \lambda_{\max }(\boldsymbol{Q}) \boldsymbol{g}^{(k) \top} \boldsymbol{g}^{(k)}

and

g(k)Q1g(k)1λmin(Q)g(k)g(k).\boldsymbol{g}^{(k) \top} \boldsymbol{Q}^{-1} \boldsymbol{g}^{(k)} \leq \frac{1}{\lambda_{\min }(\boldsymbol{Q})} \boldsymbol{g}^{(k) \top} \boldsymbol{g}^{(k)} .

Therefore, substituting the above into the formula for γk\gamma_k, we get

γkα(λmin(Q))2(2λmax(Q)α)>0.\gamma_k \geq \alpha\left(\lambda_{\min }(\boldsymbol{Q})\right)^2\left(\frac{2}{\lambda_{\max }(\boldsymbol{Q})}-\alpha\right)>0 .

Therefore, γk>0\gamma_k>0 for all kk, and k=0γk=\sum_{k=0}^{\infty} \gamma_k=\infty. Hence, by Theorem 8.18.1 we conclude that x(k)x\boldsymbol{x}^{(k)} \rightarrow \boldsymbol{x}^*. \Rightarrow : We use contraposition. Suppose that either α0\alpha \leq 0 or α2/λmax(Q)\alpha \geq 2 / \lambda_{\max }(\boldsymbol{Q}). Let x(0)\boldsymbol{x}^{(0)} be chosen such that x(0)x\boldsymbol{x}^{(0)}-\boldsymbol{x}^* is an eigenvector of Q\boldsymbol{Q} corresponding to the eigenvalue λmax(Q)\lambda_{\max }(\boldsymbol{Q}). Because

x(k+1)=x(k)α(Qx(k)b)=x(k)α(Qx(k)Qx),\boldsymbol{x}^{(k+1)}=\boldsymbol{x}^{(k)}-\alpha\left(\boldsymbol{Q} \boldsymbol{x}^{(k)}-\boldsymbol{b}\right)=\boldsymbol{x}^{(k)}-\alpha\left(\boldsymbol{Q} \boldsymbol{x}^{(k)}-\boldsymbol{Q} \boldsymbol{x}^*\right),

we obtain

x(k+1)x=x(k)xα(Qx(k)Qx)=(InαQ)(x(k)x)=(InαQ)k+1(x(0)x)=(1αλmax(Q))k+1(x(0)x)\begin{aligned} \boldsymbol{x}^{(k+1)}-\boldsymbol{x}^* &=\boldsymbol{x}^{(k)}-\boldsymbol{x}^*-\alpha\left(\boldsymbol{Q} \boldsymbol{x}^{(k)}-\boldsymbol{Q} \boldsymbol{x}^*\right) \\ &=\left(\boldsymbol{I}_n-\alpha \boldsymbol{Q}\right)\left(\boldsymbol{x}^{(k)}-\boldsymbol{x}^*\right) \\ &=\left(\boldsymbol{I}_n-\alpha \boldsymbol{Q}\right)^{k+1}\left(\boldsymbol{x}^{(0)}-\boldsymbol{x}^*\right) \\ &=\left(1-\alpha \lambda_{\max }(\boldsymbol{Q})\right)^{k+1}\left(\boldsymbol{x}^{(0)}-\boldsymbol{x}^*\right) \end{aligned}

where in the last line we used the property that x(0)x\boldsymbol{x}^{(0)}-\boldsymbol{x}^* is an eigenvector of Q\boldsymbol{Q}. Taking norms on both sides, we get

x(k+1)x=1αλmax(Q)k+1x(0)x.\left\|\boldsymbol{x}^{(k+1)}-\boldsymbol{x}^*\right\|=\left|1-\alpha \lambda_{\max }(\boldsymbol{Q})\right|^{k+1}\left\|\boldsymbol{x}^{(0)}-\boldsymbol{x}^*\right\| .

Because α0\alpha \leq 0 or α2/λmax(Q)\alpha \geq 2 / \lambda_{\max }(\boldsymbol{Q})

1αλmax(Q)1\left|1-\alpha \lambda_{\max }(\boldsymbol{Q})\right| \geq 1

Hence, x(k+1)x\left\|\boldsymbol{x}^{(k+1)}-\boldsymbol{x}^*\right\| cannot converge to 0 , and thus the sequence {x(k)}\left\{\boldsymbol{x}^{(k)}\right\} does not converge to x\boldsymbol{x}^*.

Convergence Rate

We now turn our attention to the issue of convergence rates of gradient algorithms. In particular, we focus on the steepest descent algorithm. We first present the following theorem.

Theorem 4

In the method of steepest descent applied to the quadratic function, at every step kk we have

V(x(k+1))λmax(Q)λmin(Q)λmax(Q)V(x(k))V\left(\boldsymbol{x}^{(k+1)}\right) \leq \frac{\lambda_{\max }(\boldsymbol{Q})-\lambda_{\min }(\boldsymbol{Q})}{\lambda_{\max }(\boldsymbol{Q})} V\left(\boldsymbol{x}^{(k)}\right)

Proof

In the proof of Theorem 2, we showed that γkλmin(Q)/λmax(Q)\gamma_k \geq \lambda_{\min }(\boldsymbol{Q}) / \lambda_{\max }(\boldsymbol{Q}). Therefore,

V(x(k))V(x(k+1))V(x(k))=γkλmin(Q)λmax(Q)\frac{V\left(\boldsymbol{x}^{(k)}\right)-V\left(\boldsymbol{x}^{(k+1)}\right)}{V\left(\boldsymbol{x}^{(k)}\right)}=\gamma_k \geq \frac{\lambda_{\min }(\boldsymbol{Q})}{\lambda_{\max }(\boldsymbol{Q})}

and the result follows.

Theorem 4 is relevant to our consideration of the convergence rate of the steepest descent algorithm as follows. Let

r=λmax(Q)λmin(Q)=QQ1,r=\frac{\lambda_{\max }(\boldsymbol{Q})}{\lambda_{\min }(\boldsymbol{Q})}=\|\boldsymbol{Q}\|\left\|\boldsymbol{Q}^{-1}\right\|,

called the condition number of Q\boldsymbol{Q}. Then, it follows from Theorem 8.48.4 that

V(x(k+1))(11r)V(x(k)).V\left(\boldsymbol{x}^{(k+1)}\right) \leq\left(1-\frac{1}{r}\right) V\left(\boldsymbol{x}^{(k)}\right) .

The term (11/r)(1-1 / r) plays an important role in the convergence of {V(x(k))}\left\{V\left(\boldsymbol{x}^{(k)}\right)\right\} to 0 (and hence of {x(k)}\left\{\boldsymbol{x}^{(k)}\right\} to x)\left.\boldsymbol{x}^*\right). We refer to (11/r)(1-1 / r) as the convergence ratio. Specifically, we see that the smaller the value of (11/r)(1-1 / r), the smaller V(x(k+1))V\left(\boldsymbol{x}^{(k+1)}\right) will be relative to V(x(k))V\left(\boldsymbol{x}^{(k)}\right), and hence the "faster" V(x(k))V\left(\boldsymbol{x}^{(k)}\right) converges to 0 , as indicated by the inequality above. The convergence ratio (11/r)(1-1 / r) decreases as rr decreases. If r=1r=1, then λmax(Q)=λmin(Q)\lambda_{\max }(\boldsymbol{Q})=\lambda_{\min }(\boldsymbol{Q}), corresponding to circular contours of ff (see Figure 8.6). In this case the algorithm converges in a single step to the minimizer. As rr increases, the speed of convergence of {V(x(k))}\left\{V\left(\boldsymbol{x}^{(k)}\right)\right\} (and hence of {x(k)}\left\{\boldsymbol{x}^{(k)}\right\} ) decreases. The increase in rr reflects that fact that the contours of ff are more eccentric (see, e.g., Figure 8.7). We refer the reader to [88, pp. 238, 239] for an alternative approach to the analysis above. To investigate the convergence properties of {x(k)}\left\{\boldsymbol{x}^{(k)}\right\} further, we need the following definition.

Definition

Given a sequence {x(k)}\left\{\boldsymbol{x}^{(k)}\right\} that converges to x\boldsymbol{x}^*, that is, limkx(k)x=0\lim _{k \rightarrow \infty}\left\|\boldsymbol{x}^{(k)}-\boldsymbol{x}^*\right\|=0, we say that the order of convergence is pp, where pRp \in \mathbb{R}, if

0<limkx(k+1)xx(k)xp<.0<\lim _{k \rightarrow \infty} \frac{\left\|\boldsymbol{x}^{(k+1)}-\boldsymbol{x}^*\right\|}{\left\|\boldsymbol{x}^{(k)}-\boldsymbol{x}^*\right\|^p}<\infty .

If for all p>0p>0

limkx(k+1)xx(k)xp=0,\lim _{k \rightarrow \infty} \frac{\left\|\boldsymbol{x}^{(k+1)}-\boldsymbol{x}^*\right\|}{\left\|\boldsymbol{x}^{(k)}-\boldsymbol{x}^*\right\|^p}=0,

then we say that the order of convergence is \infty. Note that in the definition above, 0/00 / 0 should be understood to be 0 . The order of convergence of a sequence is a measure of its rate of convergence; the higher the order, the faster the rate of convergence. The order of convergence is sometimes also called the rate of convergence (see, e.g., [96]). If p=1p=1 (first-order convergence) and limkx(k+1)x/x(k)x=1\lim _{k \rightarrow \infty}\left\|\boldsymbol{x}^{(k+1)}-\boldsymbol{x}^*\right\| /\left\|\boldsymbol{x}^{(k)}-\boldsymbol{x}^*\right\|=1, we say that the convergence is sublinear. If p=1p=1 and limkx(k+1)\lim _{k \rightarrow \infty} \| \boldsymbol{x}^{(k+1)}- x/x(k)x<1\boldsymbol{x}^*\|/\| \boldsymbol{x}^{(k)}-\boldsymbol{x}^* \|<1, we say that the convergence is linear. If p>1p>1, we say that the convergence is superlinear. If p=2p=2 (second-order convergence), we say that the convergence is quadratic.